This tutorial includes three sections. First, it shows how to generate EXON, EPT, EAT and gene count matrix from a BAM file. Then, we run the current-state Seurat pipeline based on the gene expression matrix. This step will tell you how many cells and cell groups we can detect. In the last step, we use Yano to detect the alternative expressed EXONs, EPTs and EATs, and you will also learn how to interpret alternative expressed EPTs with alignment tracks. Before you get started, please ensure you have adequately complied with the following software or package in a standard R and Linux work environment.
We use a BAM file (~14G) of human glioblastoma multiforme from 10X Genomics’ resource page, and a GTF file from Ensembl. The GTF is used to annotate reads and EPTs in this demo. So if you use your own data in the test, please make sure to use the same GTF in whole practice.
The following codes run in Terminal.
wget -c https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_SC3v3_Human_Glioblastoma/Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
wget -c https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_SC3v3_Human_Glioblastoma/Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam.bai
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gzThe following codes run in Terminal.
These commands tell you how to generate the count files. But you donot need to run these steps now, the processed files already put in the working directory.
# Call expressed peak tags (EPTs)
PISA callept -tag CB -umi UB -o epts.bed Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
# Annotate EPTs with gene annoation
PISA annobed -gtf gencode.v44.annotation.gtf.gz -o epts_anno.bed -skip-chrs epts.bed
# Annotate reads with EPTs, exon and gene annotation
PISA anno -gtf gencode.v44.annotation.gtf.gz -exon -bed epts.bed -tag EP -o anno.bam Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
# Create EPT and gene expression output directory
mkdir ept
mkdir exp
mkdir exon
# Generate gene X cell counts
PISA count -tag CB -umi UB -outdir exp -anno-tag GN anno.bam
# Generate exon X cell counts
PISA count -tag CB -umi UB -outdir exon -anno-tag EX anno.bam
# Generate EPT X cell counts
PISA count -tag CB -umi UB -outdir ept -anno-tag EP anno.bamNow you can check if the files generate properly.
## chr1 14665 14787 . . - 1 WASH7P intron
## chr1 134935 135281 . . - 1 ENSG00000268903 exonintron
## chr1 187036 187159 . . - 1 WASH9P exonintron
## chr1 629103 629323 . . + 1 MTND1P23 exon
## chr1 629906 630021 . . - 1 ENSG00000230021 intron
## chr1 629906 630025 . . + 1 MTND2P28 exon
## chr1 631822 631943 . . + 1 MTCO1P12 exon
## chr1 632266 632703 . . + 1 MTCO1P12 exonintron
## chr1 633103 633310 . . + 1 MTCO2P12 exon
## chr1 633365 633463 . . + 1 MTCO2P12 exonintron
## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz
## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz
## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz
The following codes run in R.
require(Seurat)
require(Yano)
require(dplyr) ## for processing table
require(DT) ## view table
count <- ReadPISA("./exp/")
# We use 1000 genes as a cutoff to filter valid cells or droplets for convenience.
# Some packages like DropletUtils and Cellbender may perform a more reliable selection.
obj <- CreateSeuratObject(count, min.features = 1000, min.cells = 10)
# We get 5164 raw cells
obj## An object of class Seurat
## 24730 features across 5164 samples within 1 assay
## Active assay: RNA (24730 features, 0 variable features)
## orig.ident nCount_RNA nFeature_RNA
## TTCCACGGTTCGGCTG-1 SeuratProject 30943 6812
## GTGCTTCGTAATGTGA-1 SeuratProject 33492 7048
## CACAGGCAGCTGGTGA-1 SeuratProject 40113 7787
## TGGCGTGGTGCAACGA-1 SeuratProject 30056 6849
## ATCTCTAGTCCATAGT-1 SeuratProject 31092 6879
## AGCGCCACAAGCGATG-1 SeuratProject 12433 4132
# Now check the gene from mitochondria and red blood cells
obj[["percent.mt"]] <- PercentageFeatureSet(obj, pattern = "^MT-")
obj[["percent.hg"]] <- PercentageFeatureSet(obj, pattern = "^HB[ABDEGQZ12]+$")
# Now new meta info added
head(obj@meta.data)## orig.ident nCount_RNA nFeature_RNA percent.mt percent.hg
## TTCCACGGTTCGGCTG-1 SeuratProject 30943 6812 6.980577 0
## GTGCTTCGTAATGTGA-1 SeuratProject 33492 7048 5.380389 0
## CACAGGCAGCTGGTGA-1 SeuratProject 40113 7787 6.160098 0
## TGGCGTGGTGCAACGA-1 SeuratProject 30056 6849 8.933324 0
## ATCTCTAGTCCATAGT-1 SeuratProject 31092 6879 5.940435 0
## AGCGCCACAAGCGATG-1 SeuratProject 12433 4132 2.718572 0
# You can visulise any meta information with VlnPlot
VlnPlot(obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt", "percent.hg"), ncol = 4)# Now filter the cell with `abnormal` high gene expression and mito gene expression
obj <- subset(obj, nFeature_RNA < 9000 & percent.mt < 20)
obj## An object of class Seurat
## 24730 features across 4973 samples within 1 assay
## Active assay: RNA (24730 features, 0 variable features)
# We only have the count matrix now, but to perform further analysis we need to normalize
# the data and to remove the bias introduced by sequence depth
obj[['RNA']]@counts[1:5,1:5]## 5 x 5 sparse Matrix of class "dgCMatrix"
## TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## ENSG00000238009 . . 1
## ENSG00000268903 . . .
## ENSG00000241860 . . .
## MTND1P23 . . .
## MTND2P28 . . .
## TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## ENSG00000238009 1 1
## ENSG00000268903 . .
## ENSG00000241860 . .
## MTND1P23 . .
## MTND2P28 1 .
# the `data` matrix used to store the normalized expression value, but now it's the same with raw counts
obj[['RNA']]@data[1:5,1:5]## 5 x 5 sparse Matrix of class "dgCMatrix"
## TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## ENSG00000238009 . . 1
## ENSG00000268903 . . .
## ENSG00000241860 . . .
## MTND1P23 . . .
## MTND2P28 . . .
## TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## ENSG00000238009 1 1
## ENSG00000268903 . .
## ENSG00000241860 . .
## MTND1P23 . .
## MTND2P28 1 .
# Log scale the gene counts, actually it's log10(counts/cell.size*scale.factor+1).
# Note: some other package like scran may use log10(counts+1)
obj <- NormalizeData(obj)
# The data matrix now updated
obj[['RNA']]@data[1:5,1:5]## 5 x 5 sparse Matrix of class "dgCMatrix"
## TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## ENSG00000238009 . . 0.22258
## ENSG00000268903 . . .
## ENSG00000241860 . . .
## MTND1P23 . . .
## MTND2P28 . . .
## TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## ENSG00000238009 0.2872162 0.2788629
## ENSG00000268903 . .
## ENSG00000241860 . .
## MTND1P23 . .
## MTND2P28 0.2872162 .
# Find the highly variable genes for downstream analysis
# We cannot use all genes for dimension reduction and clustering because two points:
# 1) it will be extremely slow
# 2) some genes may not so informative ..
# ref:
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)
# The scale step will scale gene expression with a mean value of 0.
# Usually only scale the HVGs.
# Some people like to scale all genes, but the scaled matrix is a dense matrix.
# So it will use a lot of memory and perhaps not necessary
obj <- ScaleData(obj, features = VariableFeatures(obj))
obj[['RNA']]@scale.data[1:5,1:5]## TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## HES4 -0.8057120 0.09033286 0.4755769926
## ISG15 -0.7324449 -0.73244491 0.0005621167
## TNFRSF4 -0.1925974 -0.19259736 -0.1925973591
## SAMD11 -0.1686980 4.90684781 -0.1686980112
## MXRA8 -0.4777915 0.20511186 1.9127692159
## TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## HES4 -0.09112569 -0.4150447
## ISG15 -0.35542388 0.5259282
## TNFRSF4 -0.19259736 -0.1925974
## SAMD11 -0.16869801 -0.1686980
## MXRA8 0.27292836 -0.4777915
obj <- RunPCA(obj, features = VariableFeatures(obj))
# Cluster cell groups
obj <- FindNeighbors(obj, dims = 1:10)
obj <- FindClusters(obj, resolution = 0.5)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4973
## Number of edges: 156733
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9032
## Number of communities: 13
## Elapsed time: 0 seconds
# Reduce feature space to 2D space
obj <- RunUMAP(obj, dims = 1:10)
DimPlot(obj, reduction = "umap")markers <- FindAllMarkers(obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
markers %>% group_by(cluster) %>% slice_max(n = 5, order_by = avg_log2FC) %>% datatable(options = list(scrollX = T))markers %>% group_by(cluster) %>% slice_max(n = 1, order_by = avg_log2FC) %>% pull(gene) %>% unique() -> sel# GSVA
DefaultAssay(obj) <- "RNA"
mat <- AggregateExpression(obj, assays = "RNA", slot="counts")
markers %>% group_by(cluster) %>% slice_max(n = 100, order_by = avg_log2FC) %>% pull(gene) %>% unique()->sel
mat <- mat$RNA[sel,]
dim(mat)## [1] 948 13
require(msigdbr) # to load database
require(GSVA)
require(ComplexHeatmap)
kegg.dat <- msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG")
kegg.genes <- split(kegg.dat$gene_symbol, kegg.dat$gs_name)
go.dat <- msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat != "HPO")
go.genes <- split(go.dat$gene_symbol, go.dat$gs_name)
gsva.go.result <- gsva(expr=mat, gset.idx.list=go.genes, kcdf="Poisson", verbose=FALSE, parallel.sz = 16, mx.diff=1)
Heatmap(gsva.go.result)gsva.kegg.result <- gsva(expr=mat, gset.idx.list=kegg.genes, kcdf="Poisson", verbose=FALSE, parallel.sz = 16, mx.diff=1)
Heatmap(gsva.kegg.result)The following codes run in R.
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
## [1] "RNA" "EXON" "RNA_nn" "RNA_snn" "pca" "umap"
DefaultAssay(obj) <- "EXON"
# Check the meta data for features. Try to understand between meta.features and obj@meta.data
head(obj[['RNA']]@meta.features)## vst.mean vst.variance vst.variance.expected
## ENSG00000238009 0.002614116 0.002607807 0.002740098
## ENSG00000268903 0.012467324 0.013118871 0.014557967
## ENSG00000241860 0.005228232 0.005201944 0.005743268
## MTND1P23 0.007038005 0.006989877 0.007863548
## MTND2P28 0.125477579 0.143544246 0.171295349
## MTCO1P12 0.030363965 0.030252422 0.037907084
## vst.variance.standardized vst.variable
## ENSG00000238009 0.9517202 FALSE
## ENSG00000268903 0.9011471 FALSE
## ENSG00000241860 0.9057463 FALSE
## MTND1P23 0.8888961 FALSE
## MTND2P28 0.8379927 FALSE
## MTCO1P12 0.7980678 FALSE
## data frame with 0 columns and 6 rows
## Working on assay EXON
## chr start end strand gene_name
## chr1:135141-135895/-/ENSG00000268903 chr1 135141 135895 - ENSG00000268903
## chr1:629062-629433/+/MTND1P23 chr1 629062 629433 + MTND1P23
## chr1:629640-630683/+/MTND2P28 chr1 629640 630683 + MTND2P28
## chr1:631074-632616/+/MTCO1P12 chr1 631074 632616 + MTCO1P12
## chr1:632757-633438/+/MTCO2P12 chr1 632757 633438 + MTCO2P12
## chr1:633696-634376/+/MTATP6P1 chr1 633696 634376 + MTATP6P1
## Working on assay : EXON
## Build weights on pca
## Run autocorrelation test for 142112 features.
## 37627 autocorrelated features.
# This step may take a while; depending on the feature number and cell number, the runtime may range from seconds to hours.
# The default permutation step is 1000. It's probably too overwhelming. Here, we change perm to 100 to save time.
obj <- RunBlockCorr(obj, block.name = "gene_name", block.assay = "RNA", threads=8, perm=100)## Working on assay EXON
## Working on 37627 features.
## Build weights on pca
## Processing 8649 blocks..
## Trying to retrieve data from assay RNA..
## Test dissimlarity of two processes ..
## Runtime : 3.734613 mins
## chr start end strand gene_name
## chr1:135141-135895/-/ENSG00000268903 chr1 135141 135895 - ENSG00000268903
## chr1:629062-629433/+/MTND1P23 chr1 629062 629433 + MTND1P23
## chr1:629640-630683/+/MTND2P28 chr1 629640 630683 + MTND2P28
## chr1:631074-632616/+/MTCO1P12 chr1 631074 632616 + MTCO1P12
## chr1:632757-633438/+/MTCO2P12 chr1 632757 633438 + MTCO2P12
## chr1:633696-634376/+/MTATP6P1 chr1 633696 634376 + MTATP6P1
## MoransI MoransI.pval AutoCorrFeature
## chr1:135141-135895/-/ENSG00000268903 0.048799765 4.686871e-16 FALSE
## chr1:629062-629433/+/MTND1P23 -0.003440520 7.030929e-01 FALSE
## chr1:629640-630683/+/MTND2P28 0.055577603 6.064212e-20 FALSE
## chr1:631074-632616/+/MTCO1P12 0.016632877 3.054542e-03 FALSE
## chr1:632757-633438/+/MTCO2P12 0.008546932 7.652688e-02 FALSE
## chr1:633696-634376/+/MTATP6P1 0.527425008 0.000000e+00 TRUE
## gene_name.D gene_name.r gene_name.pval
## chr1:135141-135895/-/ENSG00000268903 NA NA NA
## chr1:629062-629433/+/MTND1P23 NA NA NA
## chr1:629640-630683/+/MTND2P28 NA NA NA
## chr1:631074-632616/+/MTCO1P12 NA NA NA
## chr1:632757-633438/+/MTCO2P12 NA NA NA
## chr1:633696-634376/+/MTATP6P1 NA NA NA
obj[['EXON']]@meta.features %>% filter(gene_name.pval < 0.001) %>% datatable(options = list(scrollX = T))# Normalize Exon expression for featureplot
obj <- NormalizeData(obj)
FeaturePlot(obj, features = c("chr11:123061588-123061833/-/HSPA8", "HSPA8"), order = TRUE, pt.size=1)## Warning: Could not find HSPA8 in the default search locations, found in RNA
## assay instead
## [2023-10-30 00:32:32] [32mGTF loading..[0m
## [2023-10-30 00:33:53] [32mLoad 62700 genes.[0m
## [2023-10-30 00:33:53] [32mLoad time : 80.858 sec[0m
## Done!
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="HSPA8",cell.group = Idents(obj), max.depth=500, highlights = c(123061588,123061833))## Chromosome chr11, start 123056489, end 123064230
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
The following codes run in R.
## [1] "RNA" "EXON" "RNA_nn" "RNA_snn" "pca" "umap"
ept <- ReadPISA("./ept")
# Now we create a new ASSAY for the Seurat Object, look this step very carefully,
# and understand the different between CreateSeuratObject and CreateAssayObject
obj[['EPT']] <- CreateAssayObject(ept[,colnames(obj)], min.cells = 30)
# A new EPT assay now created
names(obj)## [1] "RNA" "EXON" "EPT" "RNA_nn" "RNA_snn" "pca" "umap"
## [1] "EXON"
## data frame with 0 columns and 6 rows
# Annotate the EPTs with predefined annotation file
obj <- LoadEPTanno(file="epts_anno.bed", object=obj)## Working on assay EPT
## Intersect 245521 features.
## chr start end name strand n_gene
## chr1:14665-14787/- chr1 14665 14787 chr1:14665-14787/- - 1
## chr1:134935-135281/- chr1 134935 135281 chr1:134935-135281/- - 1
## chr1:629103-629323/+ chr1 629103 629323 chr1:629103-629323/+ + 1
## chr1:629906-630021/- chr1 629906 630021 chr1:629906-630021/- - 1
## chr1:629906-630025/+ chr1 629906 630025 chr1:629906-630025/+ + 1
## chr1:632266-632703/+ chr1 632266 632703 chr1:632266-632703/+ + 1
## gene_name type
## chr1:14665-14787/- WASH7P intron
## chr1:134935-135281/- ENSG00000268903 exonintron
## chr1:629103-629323/+ MTND1P23 exon
## chr1:629906-630021/- ENSG00000230021 intron
## chr1:629906-630025/+ MTND2P28 exon
## chr1:632266-632703/+ MTCO1P12 exonintron
## Working on assay : EPT
## Build weights on pca
## Run autocorrelation test for 245677 features.
## 17316 autocorrelated features.
## Working on assay EPT
## Working on 17316 features.
## Build weights on pca
## Processing 7796 blocks..
## Trying to retrieve data from assay RNA..
## Test dissimlarity of two processes ..
## Runtime : 1.711828 mins
## chr start end name strand n_gene
## chr1:14665-14787/- chr1 14665 14787 chr1:14665-14787/- - 1
## chr1:134935-135281/- chr1 134935 135281 chr1:134935-135281/- - 1
## chr1:629103-629323/+ chr1 629103 629323 chr1:629103-629323/+ + 1
## chr1:629906-630021/- chr1 629906 630021 chr1:629906-630021/- - 1
## chr1:629906-630025/+ chr1 629906 630025 chr1:629906-630025/+ + 1
## chr1:632266-632703/+ chr1 632266 632703 chr1:632266-632703/+ + 1
## gene_name type MoransI MoransI.pval
## chr1:14665-14787/- WASH7P intron 0.003183699 2.906980e-01
## chr1:134935-135281/- ENSG00000268903 exonintron 0.112081399 3.187695e-75
## chr1:629103-629323/+ MTND1P23 exon 0.009076548 6.354737e-02
## chr1:629906-630021/- ENSG00000230021 intron 0.007448170 1.061480e-01
## chr1:629906-630025/+ MTND2P28 exon 0.054112384 5.234265e-19
## chr1:632266-632703/+ MTCO1P12 exonintron 0.035879924 2.177869e-09
## AutoCorrFeature gene_name.D gene_name.r gene_name.pval
## chr1:14665-14787/- FALSE NA NA NA
## chr1:134935-135281/- TRUE NA NA NA
## chr1:629103-629323/+ FALSE NA NA NA
## chr1:629906-630021/- FALSE NA NA NA
## chr1:629906-630025/+ FALSE NA NA NA
## chr1:632266-632703/+ FALSE NA NA NA
# Color by EPT types
type.cols <- c("#131313", "blue", "#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C", "#FB9A99", "#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6", "#6A3D9A", "#FFFF99", "#B15928")
names(type.cols) <- c("antisense_complex", "antisense_exon", "antisense_intron", "antisense_utr3", "antisense_utr5", "exon","exonintron", "intergenic", "intron", "multiexons", "multigenes","utr3", "utr5", "whole_gene")
FbtPlot(obj, val = "gene_name.pval", col.by = "type", cols = type.cols, size=2)df <- obj[['EPT']]@meta.features
# use %>% from dplyr
df %>% filter(gene_name.pval < 1e-10) %>% datatable(options = list(scrollX = T))# Normalize EPT counts for visulization
obj <- NormalizeData(obj)
FeaturePlot(obj, features = c("chr19:16095200-16095559/+", "TPM4"), order=TRUE, pt.size = 2)## Warning: Could not find TPM4 in the default search locations, found in RNA
## assay instead
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="TPM4",cell.group = Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(16095200,16095559), type.col = type.cols)## Chromosome chr19, start 16066021, end 16104002
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
# Capped max normalized depth to 1000
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="TPM4",cell.group = Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(16095200,16095559), type.col = type.cols, max.depth = 1000)## Chromosome chr19, start 16066021, end 16104002
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
## Warning: Could not find RBKS in the default search locations, found in RNA
## assay instead
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="RBKS",cell.group = Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(27889096,27890852), type.col = type.cols)## Chromosome chr2, start 27780379, end 27891681
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
# Log-scale normalized UMI depth
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="RBKS",cell.group = Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(27889096,27890852), type.col = type.cols, log.scaled = TRUE)## Chromosome chr2, start 27780379, end 27891681
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
exon.meta <-obj[['EXON']]@meta.features
ept.meta <- obj[['EPT']]@meta.features
alt.exon <- exon.meta %>% filter(gene_name.pval < 0.01) %>% pull(gene_name) %>% unique()
alt.ept <- ept.meta %>% filter(gene_name.pval < 0.01 & type %in% c("exon", "utr3", "utr5", "exonintron", "multiexons")) %>% pull(gene_name) %>% unique()
require(ggvenn)
ggvenn(list("EXON"=alt.exon, "EPT"=alt.ept),text_size = 8)## [1] "ENSG00000288663" "MYL6" "SERF2" "RPL17-C18orf32"
## [5] "ENSG00000268173"
## [1] "PDE4B" "NME7" "BAG3" "CD163L1"
## [5] "SIPA1L1" "SNAP23" "ANXA2" "ENSG00000285505"
## [9] "C5AR1" "RBKS" "MXD1" "NFE2L2"
## [13] "CREB1" "NFATC2" "RPL15" "MAPK10"
## [17] "PHACTR1" "DDAH2" "SOD2" "PILRA"
## [21] "HAS2-AS1"
exon.meta %>% filter(gene_name %in% setdiff(alt.exon, alt.ept) & gene_name.pval < 0.01) %>% datatable(options = list(scrollX = T))DefaultAssay(obj) <- "EXON"
FeaturePlot(obj, features = c("chr12:53303034-53306861/+/ENSG00000288663","chr12:53306680-53306861/+/ENSG00000288663","ENSG00000288663"), order = TRUE, pt.size = 2, ncol=3)## Warning: Could not find ENSG00000288663 in the default search locations, found
## in RNA assay instead
FeaturePlot(obj, features = c("chr12:53303034-53306861/+/ENSG00000288663","chr12:53306680-53306861/+/ENSG00000288663","ENSG00000288663"), order = TRUE, pt.size = 2, ncol=3, cols=c("blue","white","red"))## Warning: Could not find ENSG00000288663 in the default search locations, found
## in RNA assay instead
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="ENSG00000288663",cell.group = Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(53303034,53306861), type.col = type.cols, max.depth = 1000)## Chromosome chr12, start 53294492, end 53308174
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
## Warning: Could not find MYL6 in the default search locations, found in RNA
## assay instead
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="MYL6",cell.group = Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(56161387,56161465), type.col = type.cols, max.depth = 8000)## Chromosome chr12, start 56157346, end 56164496
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
ept.meta %>% filter(gene_name %in% setdiff(alt.ept,alt.exon) & gene_name.pval < 0.01 & type %in% c("exon","utr3","utr5","exonintron","multiexons")) %>% datatable(options = list(scrollX = T))DefaultAssay(obj) <- "EPT"
FeaturePlot(obj, features = c("chr1:65992643-65994170/+", "PDE4B"), pt.size = 2, order = TRUE)## Warning: Could not find PDE4B in the default search locations, found in RNA
## assay instead
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="PDE4B",cell.group = Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(65992643,65994170), type.col = type.cols, max.depth = 100)## Chromosome chr1, start 65791514, end 66375579
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
The following EAT analysis is based on the genetic variants. Therefore, we need to call variants from RNA reads first. The genetic variants in RNA sequence may come from three inputs:
[1] Germline variants; [2] Somatic mutations; [3] RNA editings.
Tips: for inbreeding mouse species, genetic variants are likely come from somatic mutation and RNA editings, considering very few genetic heterogeneity variants.
The following codes run in Terminal.
# download genome reference
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.p14.genome.fa.gz
gzip -d /GRCh38.p14.genome.fa.gz
# Variant calling. Here I use bcftools to call variants, other tools like GATK also appliable. This step usually be very slow.
# Donot filter variants, use raw VCF for downstream analysis. Because we are trying to label reads and generate EAT counts,
# low quality variants will be filtered during downstream QC.
bcftools mpileup -Ou -f ./GRCh38.p14.genome.fa ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam -d 10000 | bcftools call -vmO z -o var.vcf.gz
PISA anno -vcf var.vcf.gz -vcf-ss -ref-alt -vtag VF -o anno.bam ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
mkdir var
PISA count -tag CB -umi UB -anno-tag VF -outdir var anno.bam## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz
The following codes run in R.
var <- ReadPISA("./var")
obj[['VAR']] <- CreateAssayObject(var[,colnames(obj)], min.cells = 10)
DefaultAssay(obj) <- "VAR"
obj <- NormalizeData(obj)
obj <- LoadVARanno(object = obj, file = "epts_anno.bed")## Working on assay VAR
## chr start strand locus ept
## chr1:631862G>A/+ chr1 631862 + chr1:631862/+ chr1:631822-631943/+
## chr1:633192G>A/+ chr1 633192 + chr1:633192/+ chr1:633103-633310/+
## chr1:633192G=/+ chr1 633192 + chr1:633192/+ chr1:633103-633310/+
## chr1:633236A>G/+ chr1 633236 + chr1:633236/+ chr1:633103-633310/+
## chr1:633300A=/+ chr1 633300 + chr1:633300/+ chr1:633103-633310/+
## chr1:633372T>C/+ chr1 633372 + chr1:633372/+ chr1:633365-633463/+
## gene_name ept_type type
## chr1:631862G>A/+ MTCO1P12 exon alt
## chr1:633192G>A/+ MTCO2P12 exon alt
## chr1:633192G=/+ MTCO2P12 exon ref
## chr1:633236A>G/+ MTCO2P12 exon alt
## chr1:633300A=/+ MTCO2P12 exon ref
## chr1:633372T>C/+ MTCO2P12 exonintron alt
## Working on assay : VAR
## Build weights on pca
## Run autocorrelation test for 112197 features.
## 14881 autocorrelated features.
## Working on assay VAR
## Working on 14881 features.
## Build weights on pca
## Processing 5500 blocks..
## Aggregate counts..
## Test dissimlarity of two processes ..
## Runtime : 50.46148 secs
## chr start strand locus ept
## chr1:631862G>A/+ chr1 631862 + chr1:631862/+ chr1:631822-631943/+
## chr1:633192G>A/+ chr1 633192 + chr1:633192/+ chr1:633103-633310/+
## chr1:633192G=/+ chr1 633192 + chr1:633192/+ chr1:633103-633310/+
## chr1:633236A>G/+ chr1 633236 + chr1:633236/+ chr1:633103-633310/+
## chr1:633300A=/+ chr1 633300 + chr1:633300/+ chr1:633103-633310/+
## chr1:633372T>C/+ chr1 633372 + chr1:633372/+ chr1:633365-633463/+
## gene_name ept_type type MoransI MoransI.pval
## chr1:631862G>A/+ MTCO1P12 exon alt -0.003823755 0.729139599
## chr1:633192G>A/+ MTCO2P12 exon alt -0.002721279 0.666664863
## chr1:633192G=/+ MTCO2P12 exon ref 0.004190248 0.232391666
## chr1:633236A>G/+ MTCO2P12 exon alt 0.014472296 0.006542342
## chr1:633300A=/+ MTCO2P12 exon ref 0.003431257 0.271878196
## chr1:633372T>C/+ MTCO2P12 exonintron alt -0.003214307 0.696233522
## AutoCorrFeature locus.D locus.r locus.pval
## chr1:631862G>A/+ FALSE NA NA NA
## chr1:633192G>A/+ FALSE NA NA NA
## chr1:633192G=/+ FALSE NA NA NA
## chr1:633236A>G/+ FALSE NA NA NA
## chr1:633300A=/+ FALSE NA NA NA
## chr1:633372T>C/+ FALSE NA NA NA
## chr start strand locus
## chr10:17237326C=/+ chr10 17237326 + chr10:17237326/+
## ept gene_name ept_type type MoransI
## chr10:17237326C=/+ chr10:17236750-17238079/+ VIM utr3 ref 0.6508988
## MoransI.pval AutoCorrFeature locus.D locus.r locus.pval
## chr10:17237326C=/+ 0 TRUE 0.677191 0.1854308 3.551327e-38
## [1] "chr10:17237326C>T/+" "chr10:17237326C=/+"
# Question:
# How to distinguish chr10:17237326C>T/+ is a somatic mutation or germline mutation or RNA editing?
FeaturePlot(obj, features = c("chr10:17237326C>T/+", "chr10:17237326C=/+" ), order=TRUE, pt.size = 2)## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_DK.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_DK.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_DK.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggvenn_0.1.10 ggplot2_3.4.4 ComplexHeatmap_2.15.3
## [4] GSVA_1.40.1 msigdbr_7.5.1 DT_0.30
## [7] dplyr_1.1.3 Yano_0.0.0.9999 Matrix_1.6-1.1
## [10] SeuratObject_4.1.4 Seurat_4.4.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.4 spatstat.explore_3.2-5
## [3] reticulate_1.34.0 R.utils_2.12.2
## [5] tidyselect_1.2.0 RSQLite_2.3.1
## [7] AnnotationDbi_1.56.2 htmlwidgets_1.6.2
## [9] BiocParallel_1.28.3 Rtsne_0.16
## [11] ScaledMatrix_1.2.0 munsell_0.5.0
## [13] codetools_0.2-18 ica_1.0-3
## [15] future_1.33.0 miniUI_0.1.1.1
## [17] withr_2.5.1 spatstat.random_3.2-1
## [19] colorspace_2.1-0 progressr_0.14.0
## [21] Biobase_2.54.0 knitr_1.44
## [23] rstudioapi_0.15.0 stats4_4.1.2
## [25] SingleCellExperiment_1.16.0 ROCR_1.0-11
## [27] tensor_1.5 listenv_0.9.0
## [29] MatrixGenerics_1.6.0 labeling_0.4.3
## [31] GenomeInfoDbData_1.2.7 polyclip_1.10-6
## [33] bit64_4.0.5 farver_2.1.1
## [35] pheatmap_1.0.12 rhdf5_2.38.1
## [37] parallelly_1.36.0 vctrs_0.6.4
## [39] generics_0.1.3 xfun_0.40
## [41] doParallel_1.0.17 R6_2.5.1
## [43] GenomeInfoDb_1.30.1 clue_0.3-64
## [45] ggbeeswarm_0.7.2 rsvd_1.0.5
## [47] bitops_1.0-7 rhdf5filters_1.6.0
## [49] spatstat.utils_3.0-3 cachem_1.0.8
## [51] DelayedArray_0.20.0 promises_1.2.1
## [53] scales_1.2.1 beeswarm_0.4.0
## [55] gtable_0.3.4 beachmat_2.10.0
## [57] Cairo_1.6-1 globals_0.16.2
## [59] goftest_1.2-3 rlang_1.1.1
## [61] GlobalOptions_0.1.2 splines_4.1.2
## [63] lazyeval_0.2.2 spatstat.geom_3.2-7
## [65] yaml_2.3.7 reshape2_1.4.4
## [67] abind_1.4-5 crosstalk_1.2.0
## [69] httpuv_1.6.12 tools_4.1.2
## [71] nabor_0.5.0 ellipsis_0.3.2
## [73] jquerylib_0.1.4 RColorBrewer_1.1-3
## [75] BiocGenerics_0.40.0 ggridges_0.5.4
## [77] Rcpp_1.0.11 plyr_1.8.9
## [79] sparseMatrixStats_1.6.0 zlibbioc_1.40.0
## [81] purrr_1.0.2 RCurl_1.98-1.12
## [83] deldir_1.0-9 GetoptLong_1.0.5
## [85] pbapply_1.7-2 cowplot_1.1.1
## [87] S4Vectors_0.32.4 zoo_1.8-12
## [89] SummarizedExperiment_1.24.0 ggrepel_0.9.4
## [91] cluster_2.1.2 magrittr_2.0.3
## [93] magick_2.8.1 data.table_1.14.8
## [95] scattermore_1.2 circlize_0.4.15
## [97] lmtest_0.9-40 RANN_2.6.1
## [99] fitdistrplus_1.1-11 matrixStats_1.0.0
## [101] patchwork_1.1.3 mime_0.12
## [103] evaluate_0.22 xtable_1.8-4
## [105] XML_3.99-0.14 shape_1.4.6
## [107] IRanges_2.28.0 gridExtra_2.3
## [109] compiler_4.1.2 tibble_3.2.1
## [111] KernSmooth_2.23-20 crayon_1.5.2
## [113] R.oo_1.25.0 htmltools_0.5.6.1
## [115] later_1.3.1 tidyr_1.3.0
## [117] DBI_1.1.3 MASS_7.3-55
## [119] babelgene_22.9 cli_3.6.1
## [121] R.methodsS3_1.8.2 parallel_4.1.2
## [123] igraph_1.5.1 GenomicRanges_1.46.1
## [125] pkgconfig_2.0.3 sp_2.1-1
## [127] plotly_4.10.3 spatstat.sparse_3.0-2
## [129] foreach_1.5.2 annotate_1.70.0
## [131] vipor_0.4.5 bslib_0.5.1
## [133] XVector_0.34.0 stringr_1.5.0
## [135] digest_0.6.33 sctransform_0.4.1
## [137] RcppAnnoy_0.0.21 graph_1.70.0
## [139] spatstat.data_3.0-1 Biostrings_2.60.2
## [141] rmarkdown_2.25 leiden_0.4.3
## [143] uwot_0.1.16 DelayedMatrixStats_1.16.0
## [145] GSEABase_1.54.0 shiny_1.7.5.1
## [147] gtools_3.9.4 rjson_0.2.21
## [149] lifecycle_1.0.3 nlme_3.1-155
## [151] jsonlite_1.8.7 Rhdf5lib_1.16.0
## [153] viridisLite_0.4.2 limma_3.50.3
## [155] fansi_1.0.5 pillar_1.9.0
## [157] lattice_0.20-45 ggrastr_1.0.2
## [159] KEGGREST_1.34.0 fastmap_1.1.1
## [161] httr_1.4.7 survival_3.5-5
## [163] glue_1.6.2 iterators_1.0.14
## [165] png_0.1-8 bit_4.0.5
## [167] stringi_1.7.12 sass_0.4.7
## [169] HDF5Array_1.22.1 blob_1.2.4
## [171] BiocSingular_1.10.0 memoise_2.0.1
## [173] irlba_2.3.5.1 future.apply_1.11.0